Heart Disease Dataset

Heart Disease Dataset

In this article, we demonstrate solving a classification problem in TensorFlow using Estimators using the Heart Disease Dataset from the UCI Machine Learning Repository.

Picture Source: harvard.edu

Attribute Information:

  1. Age
  2. Sex
    • 0: Female
    • 1: Male
  3. Chest Pain Type
    • 1: Typical Angina
    • 2: Atypical Angina
    • 3: Non-Anginal Pain
    • 4: Asymptomatic
  1. Serum Cholestoral (in mg/dl )
  2. FBS: Fasting Blood Sugar > 120 mg/dl
    • 0 = False
    • 1 = True
  3. Resting Electrocardiographic Results
    • 0: normal
    • 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  4. Maximum Heart Rate Achieved
  5. Exercise Induced Angina
    • 0: No
    • 1: Yes
  6. Oldpeak = St Depression Induced By Exercise Relative To Rest
  7. Slope: The Slope Of The Peak Exercise ST Segment
    • 1: Upsloping
    • 2: Flat
    • 3: Downsloping
  8. Number Of Major Vessels (0-3) Colored By Flourosopy
  9. Thal
    • 3: Normal
    • 6: Fixed Defect
    • 7: Reversable Defect

Variable to be predicted

Problem Description

Developing a predictive model that can predict whether heart disease is present or absent based on the rest of the given features.

X and y sets

Train and Test sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Input Function

The input function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. Moreover, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

Moreover, an estimator model consists of two main parts, feature columns, and a numeric vector. Feature columns provide explanations for the input numeric vector. The following function separates categorical and numerical columns (features)and returns a descriptive list of feature columns.

Modeling: Boosted Trees Classifier

ROC Curves

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm. Note that due to the size of data, here we don't provide a Cross-validation evaluation. In general, this type of evaluation is preferred.

Boosted Trees Classifier with $l_1$ regularization (Lasso)

Lasso (least absolute shrinkage and selection operator) classifier was introduced within the context of the method of least squares. Lasso) alters the model fitting process to pick only a subset of the provided covariates to be used within the final model instead of using all of them and this will improve the prediction accuracy and interpretability of regression models.

ROC Curves

Confusion Matrix

Boosted Trees Classifier with $l_2$ regularization (Ridge)

ROC Curves

Confusion Matrix


References

  1. Regression analysis wikipedia page
  2. Tensorflow tutorials
  3. TensorFlow Boosted Trees Classifier
  4. Lasso (statistics))
  5. Tikhonov regularization
  6. S. Aeberhard, D. Coomans and O. de Vel, Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Technometrics).
  7. S. Aeberhard, D. Coomans and O. de Vel, “THE CLASSIFICATION PERFORMANCE OF RDA” Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland. (Also submitted to Journal of Chemometrics).